{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\nDeploy a Hugging Face Pruned Model on CPU\n=========================================\n**Author**: `Josh Fromm <https://github.com/jwfromm>`_\n\nThis tutorial demonstrates how to take any pruned model, in this case `PruneBert\nfrom Hugging Face\n<https://huggingface.co/huggingface/prunebert-base-uncased-6-finepruned-w-distil-squad>`_,\nand use TVM to leverage the model's sparsity support to produce real speedups. Although\nthe primary purpose of this tutorial is to realize speedups on already pruned\nmodels, it may also be useful to estimate how fast a model would be *if* it were\npruned. To this end, we also provide a function that takes an unpruned model and\nreplaces its weights\nwith random and pruned weights at a specified sparsity. This may be a useful\nfeature when trying to decide if a model is worth pruning or not.\n\nBefore we get into the code, it's useful to discuss sparsity and pruning\nand dig into the two\ndifferent types of sparsity: **structured** and **unstructured**.\n\nPruning is a technique primarily used to reduce the parameter size of a model\nby replacing weight values with 0s. Although many methods exist for choosing which\nweights should be set to 0, the most straight forward is by picking the \nweights with the smallest value. Typically, weights are pruned to a desired\nsparsity percentage. For example, a 95% sparse model would have only 5% of\nits weights non-zero. Pruning to very high sparsities often requires\nfinetuning or full retraining as it tends to be a lossy approximation.\nAlthough parameter size benefits are quite easy to obtain from a pruned model\nthrough simple compression, leveraging sparsity to yield runtime speedups\nis more complicated.\n\nIn structured sparsity weights are pruned with the goal of clustering\npruned weights together. In other words, they are pruned using both their\nvalue and location. The benefit of bunching up pruned weights is that it allows\nan algorithm such as matrix multiplication to skip entire blocks. It turns out\nthat some degree of *block sparsity* is very important to realizing significant\nspeedups on most hardware available today. \nThis is because when loading memory in most CPUs or GPUs, \nit doesn't save any work to skip reading a single value at a time, instead an entire\nchunk or tile is read in and executed using something like vectorized instructions.\n\nUnstructured sparse weights are those that are pruned only on the value of\nthe original weights. They may appear to be scattered randomly throughout\na tensor rather than in chunks like we'd see in block sparse weights.\nAt low sparsities, unstructured pruning techniques are difficult to\naccelerate. However, at high sparsities many blocks of all 0 values\nwill naturally appear, making it possible to accelerate.\n\nThis tutorial interacts with both structured and unstructured sparsity.\nHugging Face's PruneBert model is unstructured but 95% sparse, allowing us\nto apply TVM's block sparse optimizations to it, even if not optimally.\nWhen generating random sparse weights for an unpruned model, we do so with structured\nsparsity. A fun exercise is comparing the real speed of PruneBert with the block\nsparse speed using fake weights to see the benefit of structured sparsity.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Load Required Modules\n---------------------\nOther than TVM, scipy, the latest transformers, and\ntensorflow 2.2+ are required.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import os\nimport tvm\nimport time\nimport itertools\nimport numpy as np\nimport tensorflow as tf\nfrom tvm import relay, runtime\nfrom tvm.contrib import graph_executor\nfrom tvm.relay import data_dep_optimization as ddo\nfrom tensorflow.python.framework.convert_to_constants import (\n    convert_variables_to_constants_v2,\n)\nimport scipy.sparse as sp\n\n\n# Ask tensorflow to limit its GPU memory to what's actually needed\n# instead of gobbling everything that's available.\n# https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth\n# This way this tutorial is a little more friendly to sphinx-gallery.\ngpus = tf.config.list_physical_devices(\"GPU\")\nif gpus:\n    try:\n        for gpu in gpus:\n            tf.config.experimental.set_memory_growth(gpu, True)\n        print(\"tensorflow will use experimental.set_memory_growth(True)\")\n    except RuntimeError as e:\n        print(\"experimental.set_memory_growth option is not available: {}\".format(e))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Configure Settings\n------------------\nLet's start by defining some parameters that define the type of model\nand sparsity to run.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# The name of the transformer model to download and run.\nname = \"huggingface/prunebert-base-uncased-6-finepruned-w-distil-squad\"\n# The number of batches in an input.\nbatch_size = 1\n# The length of each input sequence.\nseq_len = 128\n# TVM platform identifier. Note that best cpu performance can be achieved by setting -mcpu\n# appropriately for your specific machine. CUDA and ROCm are also supported.\ntarget = \"llvm\"\n# Which device to run on. Should be one of tvm.cpu() or tvm.cuda().\ndev = tvm.cpu()\n# If true, then a sparse variant of the network will be run and\n# benchmarked.\nmeasure_sparse = True\n# The block size of structured sparsity to convert weight tensors\n# into. Changing this parameter may yield speedups for some platforms.\nbs_r = 1\n# For models besides PruneBert (which is 95% sparse), this parameter\n# determines how sparse the generated weights should be. The higher\n# the sparsity, the faster the result.\nsparsity = 0.85"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Download and Convert Transformers Model\n---------------------------------------\nNow we'll grab a model from the transformers module, download it,\nconvert it into a TensorFlow graphdef in preperation for converting that graphdef into\na relay graph that we can optimize and deploy.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def load_keras_model(module, name, seq_len, batch_size, report_runtime=True):\n    model = module.from_pretrained(name)\n    dummy_input = tf.keras.Input(shape=[seq_len], batch_size=batch_size, dtype=\"int32\")\n    dummy_out = model(dummy_input)  # Propagate shapes through the keras model.\n    if report_runtime:\n        np_input = np.random.uniform(size=[batch_size, seq_len], low=0, high=seq_len).astype(\n            \"int32\"\n        )\n        start = time.time()\n        repeats = 50\n        for i in range(repeats):\n            np_out = model(np_input)\n        end = time.time()\n        print(\"Keras Runtime: %f ms.\" % (1000 * ((end - start) / repeats)))\n    return model\n\n\ndef convert_to_graphdef(model, batch_size, seq_len):\n    model_func = tf.function(lambda x: model(x))\n    input_dict = model._saved_model_inputs_spec\n    input_spec = input_dict[list(input_dict.keys())[0]]\n    model_func = model_func.get_concrete_function(\n        tf.TensorSpec([batch_size, seq_len], input_spec.dtype)\n    )\n    frozen_func = convert_variables_to_constants_v2(model_func)\n    return frozen_func.graph.as_graph_def()\n\n\ndef download_model(name, batch_size, seq_len):\n    import transformers\n\n    module = getattr(transformers, \"TFBertForSequenceClassification\")\n    model = load_keras_model(module, name=name, batch_size=batch_size, seq_len=seq_len)\n    return convert_to_graphdef(model, batch_size, seq_len)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Convert to Relay Graph\n----------------------\nWe now have all the tooling to get a transformers model in the right format\nfor relay conversion. Let's import it! In the following function we\nsave the imported graph in relay's json format so that we dont have\nto reimport from tensorflow each time this script is run.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def import_graphdef(\n    name,\n    batch_size,\n    seq_len,\n    save_relay=True,\n    relay_file=\"model.json\",\n    relay_params=\"model.params\",\n):\n    abs_path = os.path.dirname(os.path.abspath(__file__))\n    shape_dict = {\"input_1\": (batch_size, seq_len)}\n    relay_file = (\"%s_%d_%d_%s\" % (name, batch_size, seq_len, relay_file)).replace(\"/\", \"_\")\n    relay_params = (\"%s_%d_%d_%s\" % (name, batch_size, seq_len, relay_params)).replace(\"/\", \"_\")\n    if os.path.exists(os.path.join(abs_path, relay_file)) and os.path.exists(\n        os.path.join(abs_path, relay_params)\n    ):\n        with open(os.path.join(abs_path, relay_file), \"r\") as fi:\n            mod = tvm.ir.load_json(fi.read())\n        with open(os.path.join(abs_path, relay_params), \"rb\") as fi:\n            params = relay.load_param_dict(fi.read())\n    else:\n        graph_def = download_model(name, batch_size, seq_len)\n\n        mod, params = relay.frontend.from_tensorflow(graph_def, shape=shape_dict)\n\n        if save_relay:\n            with open(os.path.join(abs_path, relay_file), \"w\") as fo:\n                fo.write(tvm.ir.save_json(mod))\n            with open(os.path.join(abs_path, relay_params), \"wb\") as fo:\n                fo.write(runtime.save_param_dict(params))\n\n    return mod, dict(params.items()), shape_dict"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Run the Dense Graph\n-------------------\nLet's run the default version of the imported model. Note that even if\nthe weights are sparse, we won't see any speedup because we are using\nregular dense matrix multiplications on these dense (but mostly zero)\ntensors instead of sparse aware kernels.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def run_relay_graph(mod, params, shape_dict, target, dev):\n    with relay.build_config(opt_level=3):\n        lib = relay.build(mod, target=target, params=params)\n    input_shape = shape_dict[\"input_1\"]\n    dummy_data = np.random.uniform(size=input_shape, low=0, high=input_shape[1]).astype(\"int32\")\n\n    m = graph_executor.GraphModule(lib[\"default\"](dev))\n    m.set_input(0, dummy_data)\n    m.run()\n    tvm_output = m.get_output(0)\n\n    print(m.benchmark(dev, repeat=5, number=5))\n    return tvm_output\n\n\ndef run_dense(mod, params, shape_dict, target, dev):\n    print(\"Dense Model Benchmark:\")\n    return run_relay_graph(mod, params, shape_dict, target, dev)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Run the Sparse Graph\n--------------------\nNext we'll convert the graph into a sparse representation and generate\nfake sparse weights if needed. Then we'll use the same benchmarking\nscript as dense to see how much faster we go! We apply a few relay passes\nto the graph to get it leveraging sparsity. First we use\n`simplify_fc_transpose` to use transposes on the weights of dense layers\ninto the parameters. This makes it easier to convert to matrix multiplies\nto sparse versions. Next we apply `bsr_dense.convert` to identify all\nweight matrices that can be sparse, and automatically replace them.\n\nThe `bsr_dense.convert` call below is doing the heavy lifting of identifying\nwhich weights in the model can be made sparse by checking if they are\nat least `sparsity_threshold` percent sparse. If so, it converts those\nweights into *Block Compressed Row Format (BSR)*. BSR is essentially\na representation that indexes into the nonzero chunks of the tensor,\nmaking it easy for an algorithm to load those non-zero chunks and ignore\nthe rest of the tensor. Once the sparse weights are in BSR format,\n`relay.transform.DenseToSparse` is applied to actually replace\n`relay.dense` operations with `relay.sparse_dense` calls that can be\nrun faster.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def random_bsr_matrix(M, N, BS_R, BS_C, density, dtype=\"float32\"):\n    Y = np.zeros((M, N), dtype=dtype)\n    assert M % BS_R == 0\n    assert N % BS_C == 0\n    nnz = int(density * M * N)\n    num_blocks = int(nnz / (BS_R * BS_C)) + 1\n    candidate_blocks = np.asarray(list(itertools.product(range(0, M, BS_R), range(0, N, BS_C))))\n    assert candidate_blocks.shape[0] == M // BS_R * N // BS_C\n    chosen_blocks = candidate_blocks[\n        np.random.choice(candidate_blocks.shape[0], size=num_blocks, replace=False)\n    ]\n    for i in range(len(chosen_blocks)):\n        r, c = chosen_blocks[i]\n        Y[r : r + BS_R, c : c + BS_C] = np.random.uniform(-0.1, 0.1, (BS_R, BS_C))\n    s = sp.bsr_matrix(Y, blocksize=(BS_R, BS_C))\n    assert s.data.shape == (num_blocks, BS_R, BS_C)\n    assert s.data.size >= nnz\n    assert s.indices.shape == (num_blocks,)\n    assert s.indptr.shape == (M // BS_R + 1,)\n    return s.todense()\n\n\ndef random_sparse_bert_params(func, params, density, BS_R, BS_C):\n    def deepcopy(param_dic):\n        ret = {}\n        for k, v in param_dic.items():\n            ret[k] = tvm.nd.array(v.numpy())\n        return ret\n\n    new_params = deepcopy(params)\n    dense_weight_names = relay.analysis.sparse_dense._search_dense_op_weight(func)\n    for item in dense_weight_names:\n        name = str(item)\n        shape = new_params[name].shape\n        if shape[0] % BS_R == 0 and shape[1] % BS_C == 0:\n            new_w = random_bsr_matrix(shape[0], shape[1], BS_R, BS_C, density)\n            new_params[name] = tvm.nd.array(new_w)\n    return new_params\n\n\ndef run_sparse(mod, params, shape_dict, target, dev, bs_r, sparsity, gen_weights):\n    mod, params = ddo.simplify_fc_transpose.convert(mod[\"main\"], params)\n    if gen_weights:\n        params = random_sparse_bert_params(mod, params, BS_R=bs_r, BS_C=1, density=1 - sparsity)\n    mod, params = ddo.bsr_dense.convert(mod, params, (bs_r, 1), sparsity_threshold=0.8)\n    print(\"Block Sparse Model with {blocksize}x1 blocks:\".format(blocksize=bs_r))\n    return run_relay_graph(mod, params, shape_dict, target, dev)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Run All the Code!\n-----------------\nAnd that's it! Now we'll simply call all the needed function to benchmark\nthe model according to the set parameters. Note that to run this code\nyou'll need to uncomment the last line first.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def benchmark():\n    mod, params, shape_dict = import_graphdef(name, batch_size, seq_len)\n    run_dense(mod, params, shape_dict, target, dev)\n    if measure_sparse:\n        gen_weights = \"prune\" not in name\n        run_sparse(mod, params, shape_dict, target, dev, bs_r, sparsity, gen_weights)\n\n\n# benchmark()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Sample Output\n-------------\nFor reference, below is the output of the script when run on an AMD CPU\nand shows about a 2.5X speedup from using sparsity.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Dense Model Benchmark:\n# Cannot find config for target=llvm, workload=('dense_nopack.x86', ('TENSOR', (1, 768), 'float32'), ('TENSOR', (2, 768), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.\n# Cannot find config for target=llvm, workload=('dense_nopack.x86', ('TENSOR', (1, 768), 'float32'), ('TENSOR', (768, 768), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.\n# Cannot find config for target=llvm, workload=('dense_nopack.x86', ('TENSOR', (128, 3072), 'float32'), ('TENSOR', (768, 3072), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.\n# Cannot find config for target=llvm, workload=('dense_nopack.x86', ('TENSOR', (128, 768), 'float32'), ('TENSOR', (3072, 768), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.\n# Cannot find config for target=llvm, workload=('dense_nopack.x86', ('TENSOR', (128, 768), 'float32'), ('TENSOR', (768, 768), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.\n# Cannot find config for target=llvm, workload=('batch_matmul.x86', ('TENSOR', (12, 128, 128), 'float32'), ('TENSOR', (12, 64, 128), 'float32')). A fallback configuration is used, which may bring great performance regression.\n# Cannot find config for target=llvm, workload=('batch_matmul.x86', ('TENSOR', (12, 128, 64), 'float32'), ('TENSOR', (12, 128, 64), 'float32')). A fallback configuration is used, which may bring great performance regression.\n# Runtime:             165.26 ms           (12.83 ms)\n# Block Sparse Model with 1x1 blocks:\n# Runtime:             67.75 ms            (8.83 ms)\n\n# Here is the output of this script on a GPU (GTX 1070) with the target \"cuda -libs=cublas\".\n#\n# Dense Model Benchmark:\n# Cannot find config for target=cuda -keys=cuda,gpu -libs=cublas -max_num_threads=1024 -thread_warp_size=32, workload=('dense_cublas.cuda', ('TENSOR', (1, 768), 'float32'), ('TENSOR', (2, 768), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.\n# Cannot find config for target=cuda -keys=cuda,gpu -libs=cublas -max_num_threads=1024 -thread_warp_size=32, workload=('dense_cublas.cuda', ('TENSOR', (1, 768), 'float32'), ('TENSOR', (768, 768), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.\n# Cannot find config for target=cuda -keys=cuda,gpu -libs=cublas -max_num_threads=1024 -thread_warp_size=32, workload=('dense_cublas.cuda', ('TENSOR', (128, 3072), 'float32'), ('TENSOR', (768, 3072), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.\n# Cannot find config for target=cuda -keys=cuda,gpu -libs=cublas -max_num_threads=1024 -thread_warp_size=32, workload=('dense_cublas.cuda', ('TENSOR', (128, 768), 'float32'), ('TENSOR', (3072, 768), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.\n# Cannot find config for target=cuda -keys=cuda,gpu -libs=cublas -max_num_threads=1024 -thread_warp_size=32, workload=('dense_cublas.cuda', ('TENSOR', (128, 768), 'float32'), ('TENSOR', (768, 768), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.\n# Cannot find config for target=cuda -keys=cuda,gpu -libs=cublas -max_num_threads=1024 -thread_warp_size=32, workload=('batch_matmul_cublas.cuda', ('TENSOR', (12, 128, 128), 'float32'), ('TENSOR', (12, 64, 128), 'float32'), (12, 128, 64)). A fallback configuration is used, which may bring great performance regression.\n# Cannot find config for target=cuda -keys=cuda,gpu -libs=cublas -max_num_threads=1024 -thread_warp_size=32, workload=('batch_matmul_cublas.cuda', ('TENSOR', (12, 128, 64), 'float32'), ('TENSOR', (12, 128, 64), 'float32'), (12, 128, 128)). A fallback configuration is used, which may bring great performance regression.\n# Runtime:             10.64 ms            (0.29 ms)\n# Block Sparse Model with 1x1 blocks:\n# Runtime:             6.46 ms             (0.05 ms)"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}